In this project, we are using a dataset to explore following questis:
Is there a significant difference in income between men and women?
Does the difference vary depending on other factors (e.g., education, marital status, criminal history, drug use, childhood household factors, profession, etc.)?
Now, let’s begin!
We are using the NLSY79 (National Longitudinal Survey of Youth, 1979 cohort) data set. The NLSY79 data set contains survey responses on thousands of individuals who have been surveyed every one or two years starting in 1979.
Loading the whole dataset.
## Parsed with column specification:
## cols(
## .default = col_double()
## )
## See spec(...) for full column specifications.
# Change column names (mostly) to question name abbreviations
colnames(nlsy) <- c("VERSION_R25_2012",
"CASEID_1979",
"FAM-2A_1979",
"FAM-POB_1979",
"FAM-3_1979",
"FAM-3A_1979",
"FAM-RES_1979",
"FAM-6_1979",
"R_REL-1_COL_1979",
"SCHOOL-31_1979",
"MIL-6_1979",
"WOMENS-ROLES_000001_1979",
"WOMENS-ROLES_000002_1979",
"WOMENS-ROLES_000003_1979",
"WOMENS-ROLES_000004_1979",
"WOMENS-ROLES_000006_1979",
"WOMENS-ROLES_000007_1979",
"WOMENS-ROLES_000008_1979",
"EXP-OCC_1979",
"EXP-9_1979",
"race",
"gender",
"MARSTAT-KEY_1979",
"FAMSIZE_1979",
"POVSTATUS_1979",
"POLICE-1_1980",
"POLIC-1C_1980",
"POLICE-2_1980",
"ALCH-2_1983",
"DS-8_1984",
"DS-9_1984",
"Q13-5_TRUNC_REVISED_1990",
"POVSTATUS_1990",
"HGCREV90_1990",
"jobs.num",
"NUMCH90_1990",
"AGEYCH90_1990",
"DS-12_1998",
"DS-13_1998",
"INDALL-EMP.01_2000",
"CPSOCC80.01_2000",
"OCCSP-55I_CODE_2000",
"Q2-15B_2000",
"Q10-2_2000",
"Q13-5_TRUNC_REVISED_2000",
"FAMSIZE_2000",
"TNFI_TRUNC_2000",
"POVSTATUS_2000",
"MARSTAT-COL_2000",
"MARSTAT-KEY_2000",
"MO1M1B_XRND",
"Q2-10B~Y_2012",
"INDALL-EMP.01_2012",
"OCCALL-EMP.01_2012",
"OCCSP-55I_CODE_2012",
"Q2-15A_2012",
"Q12-6_2012",
"income",
"Q13-5_SR000001_2012",
"Q13-5_SR000002_2012",
"Q13-18_TRUNC_2012",
"Q13-18_SR000001_TRUNC_2012",
"FAMSIZE_2012",
"REGION_2012",
"HGC_2012",
"URBAN-RURAL_2012",
"JOBSNUM_2012")For the convenience of analysis, we would like to generate a sub-dataframe.
We will do this step by step:
This dataset contains lots of historical data, to be specific, we want to study on a cross sectional dataset of year 2012, so we are including some variables from 2012 survey mainly, and basic information of subjects such as race, gender in historical survey (which remain stable for each subjects). We will name our subdataset as gwgap(stands for “gender wage gap”).
Below are the selected variables and brief description:
income: total income from wage and salary in 2012
gender: gender, Male/Female
race: racial group, Hispanic/Black/Non-black
educ_year: education years, continus variable of highest grade completed
region: rigion of residence
num_job: Number of jobs ever reported at interview date
urban: urban or rural residence
industry: recoded based on “four sectors theory”: primary sector/secondary sector/tertiary sector/quaternary sector/puublic sector.
We will assign other factor variables(not are they at the present) whose value is negative to NA, we don’t simply omit them so far, beacuse we may want to think carefully about whether these values are informative.
gwgap[gwgap < 0] <- NA
# summay present distribution of NA in the dataframe
# table1 Data summary for NA values
table1 <- kable(summary(gwgap))
table1| income | gender | race | educ_year | region | num_job | urban | industry | |
|---|---|---|---|---|---|---|---|---|
| Min. : 45 | Min. :1.000 | Min. :1.000 | Min. : 3.00 | Min. :1.000 | Min. : 1.00 | Min. :0.000 | Min. : 170 | |
| 1st Qu.: 23000 | 1st Qu.:1.000 | 1st Qu.:2.000 | 1st Qu.:12.00 | 1st Qu.:2.000 | 1st Qu.: 7.00 | 1st Qu.:1.000 | 1st Qu.:4670 | |
| Median : 40000 | Median :1.000 | Median :3.000 | Median :13.00 | Median :3.000 | Median :11.00 | Median :1.000 | Median :7380 | |
| Mean : 54774 | Mean :1.496 | Mean :2.347 | Mean :13.67 | Mean :2.652 | Mean :12.37 | Mean :0.822 | Mean :6273 | |
| 3rd Qu.: 65000 | 3rd Qu.:2.000 | 3rd Qu.:3.000 | 3rd Qu.:16.00 | 3rd Qu.:3.000 | 3rd Qu.:16.00 | 3rd Qu.:1.000 | 3rd Qu.:8190 | |
| Max. :343830 | Max. :2.000 | Max. :3.000 | Max. :20.00 | Max. :4.000 | Max. :58.00 | Max. :2.000 | Max. :9990 | |
| NA | NA | NA | NA | NA’s :33 | NA | NA’s :33 | NA’s :105 |
As we can see, there are NA values in region, urban, and industry.
To move a step further, we want to know the distribution of these NA values in the dataframe, we will use function aggr in the package VIM here.
Plots above is a visualization of NA values in the data frame, there is an interesting pattern: missing value of geographical variables (region and urban) and those of industryare completely non-overlapping. Since we may want to detect geographysical features of gender wage gap in this report, by looking up the nlsy description documentant, we can probably guess that the reason of missing geographical variables is “Non-interview” or “valid skip”, and also because numbers are realtively small, we can infer that those values are not informative, so we just omit them.
gwgap <- subset(gwgap, !is.na(region))
gwgap <- subset(gwgap, !is.na(urban))
# we may also want to exclue unknown rural/urban residence
gwgap <- subset(gwgap, urban != 2)But for 105 missing industry variables, we cannot simmply omit them for 2 reasons, 1) they might be highly informative, as unknown/unreported industry may indicate a strong relationship with income. 2) the number of missing values are relatively large.
We will deal with these NA values below.
## 'data.frame': 5146 obs. of 8 variables:
## $ income : num 19000 35000 105000 40000 75000 102000 70000 60000 150000 115000 ...
## $ gender : num 2 2 1 2 1 2 2 1 1 2 ...
## $ race : num 3 3 3 3 3 3 3 3 3 3 ...
## $ educ_year: num 12 12 16 14 14 19 13 13 13 17 ...
## $ region : num 1 4 2 1 1 1 1 1 4 1 ...
## $ num_job : num 3 17 12 13 5 24 12 14 4 12 ...
## $ urban : num 1 1 1 1 1 1 1 1 1 1 ...
## $ industry : num 8680 8990 7870 6870 8770 6990 7270 9570 4170 7290 ...
# some simple releveling
gwgap <- mutate(gwgap,
gender = recode_factor(gender,
`1` = "Male",# Male group is baseline
`2` = "Female"),
race = recode_factor(race,
`3` = "Other", # Other group is baseline
`2` = "Black",
`1` = "Hispanic"),
region = recode_factor(region,
`1` = "Northeast", # North east is the baseline
`2` = "Northcentral",
`3` = "South",
`4` = "West"),
urban = recode_factor(urban,
`1` = "Urban", # urabn is the baseline
`0` = "Rural"))In classical human capital theory, education is one of the most predictor of future income, so we want to include it in our analysis. Here the education is continuous variables indicating eductional years, as we all know the effect of education may not follow a continuous pattern, this is some undermining points of regression model below. We will mutate education as a factor variable afterwards. But in the final regressional model, we will still use numerical education year varible, assumpting human capital accumulation is continuous along with education. Any way, let’s look at educ_year:
##
## 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 6 4 2 12 13 39 84 69 100 2120 474 549 291 712 181
## 18 19 20
## 254 104 132
As we can see, our sub data frame doesn’t have special values such as “93” indicating PRE-KINDERGARTEN, etc. As assumpted above, we don’t need to do anything to educ_year at the present.
There are many levels in industry and they are meaningless, for the convience of analysis, we want to refactor the variable according to the “four sectors economy theory”, with a little change.
Some related links: https://www.economicshelp.org/blog/12436/concepts/sectors-economy/ By doing this, this variable can be used to analysis gender wage gap across different sectors of economy.
gwgap[is.na(gwgap)] <- 0
gwgap$industry <- cut(gwgap$industry,
breaks = c(0,130, 330, 530, 730, 1030, 4030, 4630, 6030, 6430, 6830, 7030, 7230, 7530, 7830, 7930, 8530, 8630, 8730, 9330, 9830, 9930, 9980, 9995), right = FALSE, labels = c(0, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 3, 4, 3, 4, 3, 3, 3, 3, 5, 5, 5, 5))
gwgap <- mutate(gwgap,
industry = recode_factor(industry,
`3` = "Tertiary",
`1` = "Primary",
`2` = "Secondary",
`4` = "Quaternary",
`5` = "Public",
`0` = "Unknown"))| income | gender | race | educ_year | region | num_job | urban | industry | |
|---|---|---|---|---|---|---|---|---|
| Min. : 45 | Male :2593 | Other :2746 | Min. : 3.00 | Northeast : 794 | Min. : 1.00 | Urban:4117 | Tertiary :2448 | |
| 1st Qu.: 23500 | Female:2553 | Black :1452 | 1st Qu.:12.00 | Northcentral:1247 | 1st Qu.: 7.00 | Rural:1029 | Primary : 87 | |
| Median : 40000 | NA | Hispanic: 948 | Median :13.00 | South :2097 | Median :11.00 | NA | Secondary :1009 | |
| Mean : 54932 | NA | NA | Mean :13.68 | West :1008 | Mean :12.37 | NA | Quaternary:1106 | |
| 3rd Qu.: 65000 | NA | NA | 3rd Qu.:16.00 | NA | 3rd Qu.:16.00 | NA | Public : 394 | |
| Max. :343830 | NA | NA | Max. :20.00 | NA | Max. :58.00 | NA | Unknown : 102 |
From the table above, We can get an overview of the variables in our data. We can see that gender is almost equally distributed in our sample dataset, which is a good news. Proportion of Race is similar as general population in US, Urban/Rural proportion is about 4 : 1. For other variables, let’s get a closer look by plotting.
# plot1: Income Distribution
plot1<- ggplot(gwgap, aes(x = income)) + geom_histogram() + xlab("Reported Income 2012") + ylab("Frequency") +
ggtitle("Income Ditribution") + stat_bin(binwidth = 50)
plot1## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As we can see, income distribution is right-skewed, and there are some outliers in far most side, which is top coded value. We will deal with them later.
# plot2: Region Distribution
plot2 <- ggplot(gwgap, aes(x = "", y = region, fill = region)) +
geom_bar(stat = "identity", width = 1) + coord_polar(theta = "y") +
labs(x = "", y = "", title = "") +
theme(axis.text.x = element_blank(),axis.text.y = element_blank(), axis.ticks = element_blank(),legend.title = element_blank(), legend.position = "top") + ggtitle("Region Distribution")
plot2Above is the distribution of subjects residence region, south is the most common. (But Note we chose Northeast as baseline group as to be more standard for the analysis. )
# plot3: Job numbers Distribution
plot3 <- ggplot(gwgap, aes(x = num_job)) + geom_histogram() + xlab("Number of jobs") + ylab("Frequency") + ggtitle("Job number ditribution")
plot3## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Job number distribution is as expected, right-skewed.
# plot4: Indutry Distribution
plot4 <- ggplot(gwgap, aes(x = industry)) + geom_histogram(stat = "count") + xlab("Industrial Sector") + ylab("Frequency") + ggtitle("Industy ditribution")## Warning: Ignoring unknown parameters: binwidth, bins, pad
As we can see, most people works in Tertiary Industry.
# plot5: Education Distribution
plot5 <- ggplot(gwgap, aes(x = educ_year)) + geom_histogram() + xlab("Education years") + ylab("Frequency") + ggtitle("Education ditribution") + stat_bin(binwidth = 1)
plot5## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As we can see, most are high school graduate. We will use this pattern to category education as a factor varible to conduct a further analysis in next part.
Since we are mainly interested in the gender wage gap across various factors, first we will look into some plot revealing this question.
# plot6: Overall Gender Wage Gap
plot6 <- ggplot(gwgap, aes(x = gender, y = income, fill = gender)) + geom_boxplot() + xlab("Gender") + ylab("Income") + ggtitle("Overall Gender Wage Gap")
ggplotly(plot6)As we can see from the box plot, there is a obvious difference between male and female.
# table3 High income proportion
# mutate column: high.income
gwgap <- mutate(gwgap,
high.income = as.numeric(income > 50000))
# construct prop table
high.earner.table <- gwgap%>%
dplyr::group_by(gender)%>%
dplyr::summarize(count = dplyr::n(), high.earn.rate = round(mean(high.income),4))
high.earner.table <- kable(high.earner.table)As we can see in the table, male are more likely to be in a higher income level(>50k), this may indicating some potential factors contributing to the gender wage gap. We may try to reveal it later.
# plot7: Gender wage gap broken down by income level
plot7 <- ggplot(gwgap, aes(x = gender, y = income, fill = gender)) + geom_boxplot() + xlab("Gender") + ylab("Income") + ggtitle("Gender Wage by Income Level") + facet_wrap( . ~ high.income)
ggplotly(plot7)As we can say, in both income level, gender wage gap is apparent.
# plot8: Gender wage gap broken down by region of residence
plot8 <- ggplot(gwgap, aes(x = gender, y = income, fill = gender)) + geom_boxplot() + xlab("Region of Residence") + ylab("Income") + ggtitle("Gender Wage gap break down by region") + facet_wrap( .~region)
ggplotly(plot8)As we can see, gender wage gap seems obvious again, and appears even across regions.
# plot9: Gender wage gap broken down by urban/rural residence
plot9 <- ggplot(gwgap, aes(x = gender, y = income, fill = gender)) + geom_boxplot() + xlab("Urban/Rural Residence") + ylab("Income") + ggtitle("Gender Wage gap broken down by urban/rural residence") + facet_wrap( .~urban)
ggplotly(plot9)As we expected, gender wag gap is signficiant in both area.
# plot10 Gender wage gap break down by industry
plot13 <- ggplot(gwgap, aes(x = gender, y = income, fill = gender)) + geom_boxplot() + xlab("Industry") + ylab("Income") + ggtitle("Gender Wage gap break down by industry") + facet_wrap( .~industry)
ggplotly(plot13)It’s interesting, since the gap seems varies a little across different industry, in the category “Unknown”, there seems no difference. Industry can be a interesting variable and I want to explore it further in the next part.
For race variable, we have discussed it a lot in the class, so I will not include the plot in the html report.
As mentioned, we are assuming that education variable varies continuously, but in fact it is not. Apparently, whether entering the college make a great matter, so grade 13 might be a threshold. Based on this, I recoded a fact varible, college.entry, and constructed a prop table. And then, plot a box plot to see the gender wage gap between different education completing status groups.
We found that even though women are more likely to enter into the college than man on average, the gender wage gap actually increased after entering into college. Indicating college education benefit men more than women.
I found that, for the people having 10 jobs or more, gender wage gap seems smaller that those have less job numbers.
And income and job numbers shows a sightly negative relationship, probably not linear, which might need more research.
Plots and tables about race, education and number of jobs are not inclued in the output file but you can check it in the source code rmd file.
In this part, we will do some regressions and explore a step further to look at favtors that influencing gender wage gap.
To begin with, we need to consider the potential problem of colinearity, by doing some correlation analysis here.
We are using a function from package “PerformanceAnalytics”, which can directly draw a correlation plot.
Here we are only review correlationship between numeric vairables.
# extract a sub-dataset of numeric variables
corr.table <- gwgap[, c("educ_year", "num_job")]
# plot14 Corr-analysis
plot14 <- chart.Correlation(corr.table, histogram = TRUE, PCH = 19)As we can see, we can not say that education years and number of jobs are not correlated, they have a strong relationship. But actually in my regression, I find both variables signficant. But base on current knowedge, I could not know the reason of this as well as how to deal with it, so I included them in the model anyway.
For the problmes of top-coded values, recall that there are a little peak in the far right side in income distribution plot. These values can have a big impact on the model. So we want to deal with this.
# plot15 Scatter plot, income ~ education years
plot15 <- ggplot(gwgap, aes(x = educ_year, y = income)) + geom_point() +stat_smooth(method = "lm") + ggtitle("income ~ education years")
plot15# As we can see, those top-coded value make regression line steeper like a teeterboard, based on the knowledge about income inequality situation in US, at least we can see high variance near upper bound are not caused by ecucation. So we may want to drop these subjects to have a more unbiased estimation of more general popolation.
# exclude top-coded values
lm.top <- lm(income ~ educ_year + gender + race + num_job + region + urban + industry, gwgap)# store the regression before mutating dataset.
topcoded.income <- max(gwgap$income)
topcoded.income## [1] 343830
gwgap <- subset(gwgap, income < topcoded.income)
# Now let's do the plotting again
# plot16 Scatter plot, income ~ education years
plot16 <- ggplot(gwgap, aes(x = educ_year, y = income)) + geom_point() +stat_smooth(method = "lm") + ggtitle("income ~ education years")
plot16As we can see, fitted line become more balanced after excluding top-coded values.
Now let’s do the comparision by look into the out put of regression.
# this is the overall regression after excluding top-coded values
lm.overall <- lm(income ~ educ_year + gender + race + num_job + region + urban + industry, gwgap)
# Regression output before excluding top-coded values
table.top <- kable(summary(lm.top)$coef, digits = c(3, 3, 3, 4), format = 'markdown')
table.top| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -27425.285 | 4897.108 | -5.600 | 0.0000 |
| educ_year | 8780.012 | 304.089 | 28.873 | 0.0000 |
| genderFemale | -31168.949 | 1490.484 | -20.912 | 0.0000 |
| raceBlack | -16829.715 | 1771.255 | -9.502 | 0.0000 |
| raceHispanic | -4857.612 | 2096.455 | -2.317 | 0.0205 |
| num_job | -1088.122 | 105.143 | -10.349 | 0.0000 |
| regionNorthcentral | -7631.277 | 2339.049 | -3.263 | 0.0011 |
| regionSouth | -4029.257 | 2167.072 | -1.859 | 0.0630 |
| regionWest | -3345.379 | 2500.935 | -1.338 | 0.1811 |
| urbanRural | -7741.045 | 1865.723 | -4.149 | 0.0000 |
| industryPrimary | 2836.328 | 5637.491 | 0.503 | 0.6149 |
| industrySecondary | 3880.315 | 1990.390 | 1.950 | 0.0513 |
| industryQuaternary | 6466.277 | 1932.436 | 3.346 | 0.0008 |
| industryPublic | 3847.111 | 2794.268 | 1.377 | 0.1686 |
| industryUnknown | 2639.873 | 5173.980 | 0.510 | 0.6099 |
# Regression output after excluding top-coded values
table.overall <- kable(summary(lm.overall)$coef, digits = c(3, 3, 3, 4), format = 'markdown')
table.overall| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -5799.263 | 2763.288 | -2.099 | 0.0359 |
| educ_year | 5381.374 | 174.726 | 30.799 | 0.0000 |
| genderFemale | -16724.922 | 842.440 | -19.853 | 0.0000 |
| raceBlack | -9710.468 | 991.637 | -9.792 | 0.0000 |
| raceHispanic | -2830.928 | 1176.850 | -2.406 | 0.0162 |
| num_job | -742.275 | 58.757 | -12.633 | 0.0000 |
| regionNorthcentral | -5253.426 | 1319.805 | -3.980 | 0.0001 |
| regionSouth | -3000.390 | 1223.063 | -2.453 | 0.0142 |
| regionWest | -902.031 | 1410.856 | -0.639 | 0.5226 |
| urbanRural | -4040.377 | 1044.889 | -3.867 | 0.0001 |
| industryPrimary | 8131.177 | 3143.531 | 2.587 | 0.0097 |
| industrySecondary | 7842.333 | 1118.715 | 7.010 | 0.0000 |
| industryQuaternary | 5159.093 | 1096.290 | 4.706 | 0.0000 |
| industryPublic | 13045.643 | 1555.830 | 8.385 | 0.0000 |
| industryUnknown | 6742.220 | 2897.005 | 2.327 | 0.0200 |
As we can see in the two tables above, nearly all of cofficients become signficant after droping top-coded values.
As we notice above, region west is not significant, so we may ask the question: is variable region effective/signficaiant when we add it into the model?
We will do an anova test to see if including region is significant.
# updating the regression
lm.no.region <- update(lm.overall, . ~ . - region)
# Regression Output
table.no.region <- kable(summary(lm.no.region)$coef, digits = c(3, 3, 3, 4), format = 'markdown')
table.no.region| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -8632.056 | 2628.379 | -3.284 | 0.0010 |
| educ_year | 5391.895 | 174.998 | 30.811 | 0.0000 |
| genderFemale | -16854.317 | 843.162 | -19.989 | 0.0000 |
| raceBlack | -9779.126 | 953.268 | -10.259 | 0.0000 |
| raceHispanic | -1840.222 | 1118.128 | -1.646 | 0.0999 |
| num_job | -733.932 | 58.542 | -12.537 | 0.0000 |
| urbanRural | -4440.054 | 1035.694 | -4.287 | 0.0000 |
| industryPrimary | 7712.301 | 3146.530 | 2.451 | 0.0143 |
| industrySecondary | 7387.640 | 1115.468 | 6.623 | 0.0000 |
| industryQuaternary | 5040.822 | 1097.305 | 4.594 | 0.0000 |
| industryPublic | 13163.886 | 1556.440 | 8.458 | 0.0000 |
| industryUnknown | 6923.084 | 2899.706 | 2.388 | 0.0170 |
Output seems not bad except that coefficent of race hispanic become non-significant.
Now let’s do the anova test:
## Analysis of Variance Table
##
## Model 1: income ~ educ_year + gender + race + num_job + urban + industry
## Model 2: income ~ educ_year + gender + race + num_job + region + urban +
## industry
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4992 3993144369162
## 2 4989 3977556402529 3 15587966634 6.5173 0.0002139 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we can see from the test result, it is highly significant. So we should include region in our regression.
Now let’s do some interpretation of the regression model:
Seen from regression diagnostic plots, we can see that the residuals did not have a completely constant varience, which indicating potential non-linear relationship between income and other variables.
From the normal qq plot, we can see clearly that on the higher side of the income, the residuals did not follow a normally distributed pattern, but for the rest part, it is pretty good.
We can draw similar conclusion about non-linear relationship from scale-location plot.
For the residuals vs leverage plot, because we have excluded top-coded values, the result seems pretty good.
Overall, model fitting is significant(F-statistics is 138.5671658 and p_value is nearly 0, even the r squared(0.2799765) is not high, which indicates that a large part of variance remain unexplained, based on a knowledge of standard wage model, the major missing variables could be: working years, job training reperience, etc.
Below are the intepretation of some cofficients:
Intercept: in this model, it indicates that -5799.26$ is the basic earning of such a featured person: male, non-black or hispanic, live in northeast of the country, with no education, live in urban area, having no job experience, expecting working in tieratiry industry. It is very hard to find such a person in reality, but due to the fact the income level here is negative, the interperation itself is somehow meaningless.
Race: As we can see, all else held equal, comparing to the baseline group, race black and race hispanic earns signficantly less the other group, respectively: black (-9710.47$), hispanic (-2830.93$).
num_job: all else held equal, one more job numbers is associated with a deduction of -742.27$ on average, frequently changing job may be indicators of some non-reliable characters of one person, which may result in lower income.
region: all else held equal, we can say that comparing to the Northeast group, people in other South and Northcentral part of the country earn less, for the west part, the coef is not significant, this might because West part is as developed as Northeast in recent years, but the variance of income could be large, because there are still many rural areas in West.
urban/rural: the coef is as expected, people in rural area earns -4040.38$ less on average.
industry: the baseline group here is tertiary sector, people in other industry all earn more, which is not interesting. This may be related to the denefination of industry in our dataset here. An impressive pattern is that, people in public sector earn 13045.64$ more than baseline group.
# constructing a table for analyzing gender wage gap between industry groups
industry.plot.df <- gwgap%>%
group_by(gender, industry)%>%
summarise(mean.income = mean(income),
lower = t.test(income)$conf.int[1],
upper = t.test(income)$conf.int[2])
kable(industry.plot.df) | gender | industry | mean.income | lower | upper |
|---|---|---|---|---|
| Male | Tertiary | 45455.43 | 43510.71 | 47400.15 |
| Male | Primary | 54169.11 | 44844.43 | 63493.80 |
| Male | Secondary | 54678.83 | 52277.99 | 57079.66 |
| Male | Quaternary | 70768.63 | 66692.63 | 74844.62 |
| Male | Public | 70814.03 | 65521.89 | 76106.17 |
| Male | Unknown | 50681.80 | 40588.48 | 60775.12 |
| Female | Tertiary | 34780.21 | 33297.74 | 36262.68 |
| Female | Primary | 31955.25 | 12695.58 | 51214.92 |
| Female | Secondary | 40146.41 | 36593.14 | 43699.68 |
| Female | Quaternary | 45038.05 | 42788.03 | 47288.08 |
| Female | Public | 49635.20 | 45843.26 | 53427.14 |
| Female | Unknown | 48986.87 | 38671.00 | 59302.73 |
Using the table above, we can draw some graphs.
# ploting mean income across industry groups and adding error bars
industry.plot <- ggplot(data = industry.plot.df, aes(x = industry, y = mean.income, fill = gender))
plot18 <- industry.plot + geom_bar(position = "dodge", stat = "identity") + geom_errorbar(aes(ymin = lower, ymax = upper), width = .2, position = position_dodge(0.9)) + ylab("average income")
ggplotly(plot18)As we can see in the graph, there are wage gap in nearly all industry groups except “Unknown”, but for some group, the confidence interval of mean income is wide, indicating that the gender wage gap may vary dramatically.
To be more clear, we will look directly into the gender wage gap of industry groups and draw the error bar to see if they are significant.
# before ploting, we will simply calculating and see if they are significant:
industry.analysis <- gwgap%>%
group_by(industry)%>%
summarise(diff.sig = round(t.test(income ~ gender)$p.value, 4))
kable(industry.analysis)| industry | diff.sig |
|---|---|
| Tertiary | 0.0000 |
| Primary | 0.0391 |
| Secondary | 0.0000 |
| Quaternary | 0.0000 |
| Public | 0.0000 |
| Unknown | 0.8139 |
As we can see, in group primary and unknown, gender wage gap is not significant, so far.
continue with plotting:
# here are another sub data frame used to plot the gender wage gap
diff.plot.df <- gwgap%>%
group_by(industry)%>%
summarise(income.diff = t.test(income ~ gender)$estimate[1] - t.test(income ~ gender)$estimate[2],
lower = t.test(income ~ gender)$conf.int[1],
upper = t.test(income ~ gender)$conf.int[2])
kable(diff.plot.df)| industry | income.diff | lower | upper |
|---|---|---|---|
| Tertiary | 10675.220 | 8231.055 | 13119.38 |
| Primary | 22213.864 | 1204.646 | 43223.08 |
| Secondary | 14532.416 | 10250.566 | 18814.27 |
| Quaternary | 25730.574 | 21079.106 | 30382.04 |
| Public | 21178.838 | 14686.733 | 27670.94 |
| Unknown | 1694.933 | -12555.061 | 15944.93 |
# Now we can draw the gender wage gap of industry groups and error bars
diff.plot <- ggplot(data = diff.plot.df, aes(x = industry, y = income.diff, fill = industry))
plot19 <- diff.plot + geom_bar(position = "dodge", stat = "identity") + geom_errorbar(aes(ymin = lower, ymax = upper), width = .2, position = position_dodge(0.9)) + ylab("average gender wage gap")
ggplotly(plot19)The plot is saying the some conclusion but more vivid.
industry interacts with gender to affect income?An advantage of analyzing interaction terms is we can decompose the effect that industry has on gender wage gap by control all other substantial variabls that influcencing income.
# updating the regression
lm.interact <- update(lm.overall, . ~ . + gender * industry)
# table of output
interact.table <- kable(summary(lm.interact)$coef, digits = c(3, 3, 3, 4), format = 'markdown')
interact.table| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -6525.163 | 2784.709 | -2.343 | 0.0192 |
| educ_year | 5319.822 | 175.122 | 30.378 | 0.0000 |
| genderFemale | -14325.203 | 1170.200 | -12.242 | 0.0000 |
| raceBlack | -9635.984 | 990.867 | -9.725 | 0.0000 |
| raceHispanic | -2867.767 | 1175.324 | -2.440 | 0.0147 |
| num_job | -731.217 | 58.756 | -12.445 | 0.0000 |
| regionNorthcentral | -5262.700 | 1318.276 | -3.992 | 0.0001 |
| regionSouth | -3007.389 | 1222.301 | -2.460 | 0.0139 |
| regionWest | -784.401 | 1410.027 | -0.556 | 0.5780 |
| urbanRural | -3974.361 | 1044.728 | -3.804 | 0.0001 |
| industryPrimary | 10784.818 | 3508.788 | 3.074 | 0.0021 |
| industrySecondary | 8868.478 | 1355.592 | 6.542 | 0.0000 |
| industryQuaternary | 10777.671 | 1785.127 | 6.037 | 0.0000 |
| industryPublic | 15705.518 | 2308.690 | 6.803 | 0.0000 |
| industryUnknown | 5144.453 | 3911.232 | 1.315 | 0.1885 |
| genderFemale:industryPrimary | -9675.723 | 7906.926 | -1.224 | 0.2211 |
| genderFemale:industrySecondary | -1037.806 | 2431.606 | -0.427 | 0.6695 |
| genderFemale:industryQuaternary | -8659.843 | 2184.793 | -3.964 | 0.0001 |
| genderFemale:industryPublic | -4654.216 | 3093.511 | -1.505 | 0.1325 |
| genderFemale:industryUnknown | 4209.085 | 5795.431 | 0.726 | 0.4677 |
It’s interesting, only one coefficient of interaction terms is signficant.
Is it ineffective to include the interection term?
Let’s conduct anova analysis.
## Analysis of Variance Table
##
## Model 1: income ~ educ_year + gender + race + num_job + region + urban +
## industry
## Model 2: income ~ educ_year + gender + race + num_job + region + urban +
## industry + gender:industry
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4989 3977556402529
## 2 4984 3962515384400 5 15041018129 3.7837 0.002016 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
See! It’s significant, so what does this mean?
Well, considering plotting analysis above, we can infer that though the gender wage gap is significant in many industry groups, after controling for some variables in the regression, we find that in this quaternary sector group, the gender wage gap still remain signficant. We can infer to things from this, difference between industry groups may be accounted for by other varibles, which our model successfully captured, so most of the gap become non-signficant, but there is still, for some reason, the gender wage gap is significant in quaternary group. And the reason is unclear, we have to do further research to find out why.
The intepration of the coffecient is that all others held equal, men in quaternary earns 8660$ more than women on average.
The difinenation of “Quaternary sector” is: a knowledge-based part of the economy, which typically includes knowledge-oriented economic sectors such as information technology, media, research and development; information-based services such as information-generation and information-sharing; and knowledge-based services such as consultation, education, financial planning, blogging, and designing.
Here comes the most interesting thing! Above are almost exactly the business or jobs we would go to after graduation.
Overall, we can draw the conclusion that gender wage gap is almost universal (when comparing man and women in same condition), at least for the factors we included in the model.
Another fact is there are many factors that can have interact or overlapping effect on gender wage gap, some of the mechanism we do know the reason, for some other, like the one we explored in the report “industry”, the reasons of the significance of the interaction term needs further more research on.
I have like 70% confidence on my analysis, one most important deduction is that we didn’t include possibily most powerful factor, which is job-realted experience or capacity level, that influences income. And the regression model itself can also be potentially improved by adding or replacing some varibles or interaction terms. What’s more, as we mentioned before, non-linear pattern of the effect of education can be a potential weak point of the analysis.